834 research outputs found
Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation
Myriad of graph-based algorithms in machine learning and data mining require
parsing relational data iteratively. These algorithms are implemented in a
large-scale distributed environment in order to scale to massive data sets. To
accelerate these large-scale graph-based iterative computations, we propose
delta-based accumulative iterative computation (DAIC). Different from
traditional iterative computations, which iteratively update the result based
on the result from the previous iteration, DAIC updates the result by
accumulating the "changes" between iterations. By DAIC, we can process only the
"changes" to avoid the negligible updates. Furthermore, we can perform DAIC
asynchronously to bypass the high-cost synchronous barriers in heterogeneous
distributed environments. Based on the DAIC model, we design and implement an
asynchronous graph processing framework, Maiter. We evaluate Maiter on local
cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves
as much as 60x speedup over Hadoop and outperforms other state-of-the-art
frameworks.Comment: ScienceCloud 2012, TKDE 201
Recommended from our members
Quantifying AS Path Inflation by Routing Policies
A route in the Internet may take a longer AS path than the shortest AS path due to routing policies. In this paper, we systematically analyze AS paths and quantify the extent to which routing policies inflate AS paths. The results show that AS path inflation in the Internet is more prevalent than expected. We first present the extent of AS path inflation observed from the RouteView and RIPE routing tables. We then employ three common routing policies to show the extent of AS path inflation. We find that No-Valley routing policy causes the least AS path inflation among the three routing policies. PreferCustomer-and-Peer-over-Provider policy causes the most AS path inflation. In addition, we find that single-homed stub ASes experience more path inflations than transit ASes and multi-homed ASes. The AS pairs with shortest AS path of 3 AS hops experience more path inflations than other AS pairs. Finally, we investigate the AS path inflation on the end-to-end path from end users to two popular content providers, Google and Comcast. Although the majority of the shortest AS paths from end users to the two providers consists of no more than three AS hops, the actual end-to-end paths that the traffic will take are longer than the shortest AS paths in many cases. Quantifying AS path inflation in the Internet has important implications on the extent of routing policies, traffic engineering performed on the Internet, and BGP convergence speed
CSD: Discriminance with Conic Section for Improving Reverse k Nearest Neighbors Queries
The reverse nearest neighbor (RNN) query finds all points that have
the query point as one of their nearest neighbors (NN), where the NN
query finds the closest points to its query point. Based on the
characteristics of conic section, we propose a discriminance, named CSD (Conic
Section Discriminance), to determine points whether belong to the RNN set
without issuing any queries with non-constant computational complexity. By
using CSD, we also implement an efficient RNN algorithm CSD-RNN with a
computational complexity at . The comparative
experiments are conducted between CSD-RNN and other two state-of-the-art
RkNN algorithms, SLICE and VR-RNN. The experimental results indicate that
the efficiency of CSD-RNN is significantly higher than its competitors
The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems
Now we live in an era of big data, and big data applications are becoming
more and more pervasive. How to benchmark data center computer systems running
big data applications (in short big data systems) is a hot topic. In this
paper, we focus on measuring the performance impacts of diverse applications
and scalable volumes of data sets on big data systems. For four typical data
analysis applications---an important class of big data applications, we find
two major results through experiments: first, the data scale has a significant
impact on the performance of big data systems, so we must provide scalable
volumes of data sets in big data benchmarks. Second, for the four applications,
even all of them use the simple algorithms, the performance trends are
different with increasing data scales, and hence we must consider not only
variety of data sets but also variety of applications in benchmarking big data
systems.Comment: 16 pages, 3 figure
Making Networks Robust to Component Failures
In this thesis, we consider instances of component failure in the Internet and in networked cyber-physical systems, such as the communication network used by the modern electric power grid (termed the smart grid). We design algorithms that make these networks more robust to various component failures, including failed routers, failures of links connecting routers, and failed sensors. This thesis divides into three parts: recovery from malicious or misconfigured nodes injecting false information into a distributed system (e.g., the Internet), placing smart grid sensors to provide measurement error detection, and fast recovery from link failures in a smart grid communication network.
First, we consider the problem of malicious or misconfigured nodes that inject and spread incorrect state throughout a distributed system. Such false state can degrade the performance of a distributed system or render it unusable. For example, in the case of network routing algorithms, false state corresponding to a node incorrectly declaring a cost of 0 to all destinations (maliciously or due to misconfiguration) can quickly spread through the network. This causes other nodes to (incorrectly) route via the misconfigured node, resulting in suboptimal routing and network congestion. We propose three algorithms for efficient recovery in such scenarios and evaluate their efficacy.
The last two parts of this thesis consider robustness in the context of the electric power grid. We study the use and placement of a sensor, called a Phasor Measurement Unit (PMU), currently being deployed in electric power grids worldwide. PMUs provide voltage and current measurements at a sampling rate orders of magnitude higher than the status quo. As a result, PMUs can both drastically improve existing power grid operations and enable an entirely new set of applications, such as the reliable integration of renewable energy resources. However, PMU applications require correct (addressed in thesis part 2) and timely(covered in thesis part 3) PMU data. Without these guarantees, smart grid operators and applications may make incorrect decisions and take corresponding (incorrect) actions.
The second part of this thesis addresses PMU measurement errors, which have been observed in practice. We formulate a set of PMU placement problems that aim to satisfy two constraints: place PMUs near each other to allow for measurement error detection and use the minimal number of PMUs to infer the state of the maximum number of system buses and transmission lines. For each PMU placement problem, we prove it is NP-Complete, propose a simple greedy approximation algorithm, and evaluate our greedy solutions.
In the last part of this thesis, we design algorithms for fast recovery from link failures in a smart grid communication network. We propose, design, and evaluate solutions to all three aspects of link failure recovery: (a) link failure detection, (b) algorithms for pre-computing backup multicast trees, and (c) fast backup tree installation.
To address (a), we design link-failure detection and reporting mechanisms that use OpenFlow to detect link failures when and where they occur inside the network. OpenFlow is an open source framework that cleanly separates the control and data planes for use in network management and control. For part (b), we formulate a new problem, Multicast Recycling, that pre-computes backup multicast trees that aim to minimize control plane signaling overhead. We prove Multicast Recycling is at least NP-hard and present a corresponding approximation algorithm. Lastly, two control plane algorithms are proposed that signal data plane switches to install pre-computed backup trees. An optimized version of each installation algorithm is designed that finds a near minimum set of forwarding rules by sharing forwarding rules across multicast groups. This optimization reduces backup tree install time and associated control state. We implement these algorithms using the POX open-source OpenFlow controller and evaluate them using the Mininet emulator, quantifying control plane signaling and installation time
- …